[DSv4][Nvidia] SM12x DeepSeek V4 support#40991
[DSv4][Nvidia] SM12x DeepSeek V4 support#40991jasl wants to merge 11 commits intovllm-project:mainfrom
Conversation
|
@WoosukKwon |
There was a problem hiding this comment.
Code Review
This pull request introduces support for DeepSeek V4 models, including updates to DeepGEMM integration, new FP8 einsum kernels for SM12x, and infrastructure for sparse MLA attention. However, there are two critical issues: the removal of the optional dependency check for tilelang in vllm/model_executor/layers/mhc.py will break installations on non-CUDA platforms, and the replacement of DeepseekV4MLP with DeepseekV2MLP for shared experts removes necessary swiglu_limit clamping, which is vital for numerical stability in FP8 inference.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e2adf8a9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
The PR is ready to review. |
|
Thanks to your development, the length of the context currently supported locally has increased significantly, and the speed of decode has increased a lot. It's amazing! |
|
@jasl My understanding is that your current approach supports SM120 through a combination of DeepGEMM and Triton. I wonder whether a pure Triton implementation, without depending on DeepGEMM at all, would be cleaner and perhaps worth considering as an alternative. I’d be interested to hear your thoughts. |
I don't have a preference. |
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Protect hybrid-aligned DeepSeek V4 MLA prompt cache blocks so they survive decode and unrelated long-session cache churn. Keep common-prefix accounting aware of the extra protection reference and cover compressor-state SlidingWindowMLA groups in a regression test. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
|
@jasl The engine crashed with following errors while running latest ds4-sm120-full after few hours of random usage. Unfortunately I cannot replicate it. Btw, would you mind to allow Issues on your repository, so we can report in more structured ways, please? |
Signed-off-by: jasl <jasl9187@hotmail.com>
Verified on 2x A5000 sm_86: MLA attention + DeepSeekMoE + bf16 produces 'The capital of France is Paris. The official language is French...' at PP=2. Triton sparse-MLA from PR vllm-project#40991 + sm_8x gate works on Ampere. Confirms rungs 2 of the model ladder for V4-Flash on Ampere.
I'm doing a long run smoking test, not sure I can trigger it. |
|
@v1b3coder |
|
I don't see Issues available at https://github.com/jasl/vllm . Maybe you have other repo on mind? |
I need to enable Issue feature myself, now you should see it. |
Route SM12x sparse MLA decode metadata around DeepGEMM scheduler metadata instead of returning placeholder metadata. Let get_paged_mqa_logits_metadata call the backend normally so unexpected SM12x metadata calls fail through the backend. Also keep SM12x FP8 MQA and paged MQA local fallback dispatch from initializing DeepGEMM before the SM12x guard runs. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
|
Thank you, much appreciated. It crashed for second time now, I opened issue. |
Five additional gates extended from SM12x-only to also include sm_8x (Ampere/Ada). PR vllm-project#40991 author noted 'SM80/86/89 architectures could theoretically use identical Triton approaches'; this validates that claim across the V4-Flash forward path: - vllm/utils/deep_gemm.py: * fp8_mqa_logits dispatch (line ~626) * fp8_paged_mqa_logits dispatch (line ~924) * tf32_hc_prenorm_gemm dispatch (line ~1004) All three Triton kernels live in deepseek_v4_triton_kernels.py and are sm_80+ portable. DeepGEMM fallback is Hopper/Blackwell-only and raises _missing() on Ampere — without these gates V4 attention dies on the first token. - vllm/model_executor/layers/deepseek_v4_attention.py: _use_deepseek_v4_sm12x_triton_fp8_einsum widened to capability.major in (8, 12). DeepGEMM equivalent isn't available on Ampere. - vllm/model_executor/layers/mhc.py: Hyperconnections (mhc_pre/post/hc_head) used TileLang JIT on CUDA. TileLang requires sm_89+ — fails JIT compile on sm_80/sm_86. New helper _should_use_mhc_torch_fallback() routes torch reference impl on Ampere and ROCm. Numerically equivalent, ~1.5-2x slower.
|
Independent validation on dual DGX Spark GB10 (SM 12.1a, 121 GiB UMA each), TP=2 over QSFP RDMA, against PR head Quant: Validated graphs-ON (no
Mini-suite at 256K × 2 graphs-ON: 10 / 10 PASS (smoke 4/4 incl. tool-calling, generation 3 prompts × non-thinking + think-high). Two observations:
Findings doc + raw evidence: Thank you for the SM12x work — this is the first config that actually serves long-context DSV4-Flash on consumer Blackwell. |
zyongye
left a comment
There was a problem hiding this comment.
First of all, thank you for your contribution.
Overall I think this PR is not ready to merge yet. There's too much changes unrelated to the model enablement. The tests are not well structured and is only testing meaningless things. I suggest we can do some cleanup before the next round review.
Regarding to changes to the core file, the only change we expect is too adding new kernels and branch out in necessary places. However, all kernel implementation should live in separate file so that we can keep the core file clean, (e.g. deepseek_v4_attention.py, sparse_attn_indexer.py). Similar suggestion is also proposed to AMD enablement.
I do suggest take a look at AMD enablement to see if the kernel can be overlapped from there.
| get_paged_mqa_logits_metadata, | ||
| ) | ||
| from vllm.utils.deep_gemm import ( | ||
| fp8_fp4_paged_mqa_logits as fp8_paged_mqa_logits, |
There was a problem hiding this comment.
why do we want to change this? I do prefer keep the original name.
| ) | ||
|
|
||
|
|
||
| def _make_mega_moe_config( |
There was a problem hiding this comment.
If we solely testing backend selection then we don't need to include this into the tests
There was a problem hiding this comment.
Don't think this is needed as well.
There was a problem hiding this comment.
I suggest adding another branch in forward_cuda and create another function just for that. Looking at forward_hip
| from vllm.config import VllmConfig, get_current_vllm_config | ||
| from vllm.distributed import ( | ||
| get_ep_group, | ||
| get_pp_group, |
There was a problem hiding this comment.
PP support should come from different PR.
| _SM120_PAGED_MQA_TOPK_CHUNK_SIZE = 8192 | ||
|
|
||
|
|
||
| def _fp8_mqa_logits_head_chunk_size( |
There was a problem hiding this comment.
All non-deepgemm library operation should move to a different file. This is the file that solely to interface with deepgemm.
| and is_triton_sparse_mla_enabled_for_platform() | ||
| and not triton_sparse_mla_cudagraphs_allowed(vllm_config) | ||
| ): | ||
| return AttentionCGSupport.NEVER |
There was a problem hiding this comment.
CUDA graph support should come with this PR.
| model_version="deepseek_v4", | ||
| ) | ||
|
|
||
| def _forward_sparse_mla_swa_decode_triton( |
|
I'm closing this one in favor of #41834 |
The PR combines #40929, now it's DeepGEMM free, thanks to @bbbearxyz !
UPDATE: To better aligh with Deepseek official API and the B200 code path, I made a harness to help to measure correctness, performance, and quality https://github.com/jasl/vllm-ds4-sm120-harness
And I will put the latest report for people to review
Summary
This PR enables DeepSeek V4 Flash to serve on NVIDIA SM12x GPUs, tested on a
2x RTX PRO 6000 Blackwell Workstation Edition host.
The important change from the earlier prototype is that this PR no longer pins
or rewrites the DeepGEMM dependency. The branch keeps vLLM's upstream DeepGEMM
installer and CMake metadata intact, and implements the required SM12x runtime
fallbacks in vLLM:
fp8_ds_mlasparse MLA cache handling.Motivation
DeepSeek V4 currently relies on kernels that are available on Hopper and
datacenter Blackwell paths, but not on SM120 / SM121 workstation and consumer
Blackwell GPUs. In particular, SM12x cannot directly reuse SM90 WGMMA kernels
or SM100 tcgen05 kernels.
This PR adds correctness-first portable kernels for the missing SM12x pieces,
then optimizes the hot sparse MLA paths enough for real serving. The result is
a reviewable vLLM-side compatibility layer that does not require maintainers to
accept a temporary DeepGEMM fork pin.
Scope
Included:
fp8_ds_mlapacked cache decode for SWA and compressed sparse candidates.model path.
Not included:
broad for this PR.
Runtime controls
The SM12x sparse MLA path registers its environment variables in
vllm.envs,so users should not see unknown-variable warnings for these knobs.
VLLM_TRITON_MLA_SPARSE1forces the Triton sparse MLA path,0disables it. When unset, vLLM enables it on SM12x where FlashMLA sparse is unavailable.VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE512VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE256VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE1,2, and4; benchmarks used4.VLLM_TRITON_MLA_SPARSE_MATMUL_DECODEVLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH1forces allow,0disables.Operational warning: do not set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truewith the TP=2 CUDA graphconfiguration used below. In local testing it made custom all-reduce fail
during CUDA graph address registration. Leaving it unset avoids that failure.
Branches
Formal PR branch:
Preview / evaluation branch with extra community performance work and MTP fixes:
The preview branch is not intended as the review target. It exists so users can
try the broader optimization stack while this PR stays focused.
Test environment
Hardware:
Software:
Benchmark environment:
Validation
Formal PR branch checks:
Result:
Compile check:
Targeted tests:
Result:
Diff hygiene:
Result: clean.
Preview branch focused checks:
Result:
Serving command
Formal PR branch, no MTP:
Preview branch, MTP:
Benchmark command
The short-context benchmark uses
128 -> 512; the long-context benchmark uses8192 -> 512. Each row uses 48 prompts andtemperature=0.Formal PR branch benchmark
Branch:
Server memory setting:
MTP is not included in this branch. Starting the formal branch with
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'failsbecause the MTP fix stack is intentionally kept separate.
Result directory:
Preview branch benchmark
Branch:
Server memory setting:
This branch includes the separate MTP fixes and community performance patches.
It is for evaluation only, not the formal PR review target.
Startup notes:
Result directory:
Review notes
Changes made before this update:
clearly separated from serving kernels.
Known follow-ups
ds4-sm120-fullcan continue to carry community performance patches forpublic evaluation.
indexer, MoE, collectives, sampling, and sparse MLA rather than broadening
this PR.